Method of synthesizing creaky voice

ABSTRACT

The invention relates to a method of synthesizing a signal comprising the steps of: a) providing of a first signal having first periods of a first type and second periods of a second type in an alternating sequence, b) selecting of one of the pitch bells for a first one of the required pitch bell locations by identifying the nearest neighboring period of the first one of the required pitch bell locations being of the first type, and selecting of the pitch bell of the identified period, c) selecting of one of the pitch bells for a second one of the required pitch bell locations by identifying a nearest neighboring period of the second one of the required pitch bell locations having the second type, and selecting the pitch bell of the identified period, whereby the steps b) and c) are carried out for all of the required pitch bell locations.

The present invention relates to the field of synthesizing of speech,and more particularly without limitation, to the field of text-to-speechsynthesis.

The function of a text-to-speech (TTS) synthesis system is to synthesizespeech from a generic text in a given language. Nowadays, TTS systemshave been put into practical operation for many applications, such asaccess to databases through the telephone network or aid to handicappedpeople. One method to synthesize speech is by concatenating elements ofa recorded set of subunits of speech such as demisyllables orpolyphones. The majority of successful commercial systems employ theconcatenation of polyphones.

The polyphones comprise groups of two (diphones), three (triphones) ormore phones and may be determined from nonsense words, by segmenting thedesired grouping of phones at stable spectral regions. In aconcatenation based synthesis, the conversation of the transitionbetween two adjacent phones is crucial to assure the quality of thesynthesized speech. With the choice of polyphones as the basic subunits,the transition between two adjacent phones is preserved in the recordedsubunits, and the concatenation is carried out between similar phones.

Before the synthesis, however, the phones must have their duration andpitch modified in order to fulfil the prosodic constraints of the newwords containing those phones. This processing is necessary to avoid theproduction of a monotonous sounding synthesized speech. In a TTS system,this function is performed by a prosodic module. To allow the durationand pitch modifications in the recorded subunits, many concatenationbased TTS systems employ the time-domain pitch-synchronous overlap-add(TD-PSOLA) (E. Moulines and F. Charpentier, “Pitch synchronous waveformprocessing techniques for text-to-speech synthesis using diphones,”Speech Commun., vol. 9, pp. 453-467, 1990) model of synthesis.

When a signal is to be synthesized with an increased duration by meansof a known PSOLA method, each of the pitch bells is repeated a number oftimes corresponding to the desired increase of the duration. Forexample, if the duration is to be doubled each period of the originalsignal is repeated. When this approach is applied to creaky voice, theresulting synthesized signal sounds unnatural and the creaky characterof the voice is lost.

The present invention therefore aims to provide an improved method ofsynthesizing a signal which enables to synthesize creaky voice. Furtherthe present invention aims to provide a corresponding computer programproduct and computer system, in particular, a text-to-speech system.

The present invention provides for a method of synthesizing a signalhaving alternating strong and weak periods as it is the case for creakyvoice.

Creaky voice is often found at the end of a sentence where the pitch ofa speaker is at its low end. Creaky voice is characterized byirregularity of pitch-period durations. One common version of creakyvoice has alternating strong and weak periods. The present invention isbased on the discovery that by application of a prior art PSOLA-typemethod for synthesizing a signal having an increased duration thealternation of the strong and weak periods is lost and that therefore anunnatural sounding amplitude variation is added to the synthesizedspeech. The invention enables to preserve such a creaky voicecharacteristic in the synthesized signal.

In accordance with a preferred embodiment of the invention the strongand the weak periods of an original creaky voice sound signal areclassified by marking the periods with different class-types. Thisinformation is used to make an alternating choice between the strong andthe weak periods. By choosing nearest neighboring periods for theselection of pitch bells also the form of the signal envelope ispreserved in the synthesized signal having the increased duration.

The present invention is particularly advantageous for text-to-speechsynthesis systems. In accordance with a preferred embodiment of theinvention such a text-to-speech synthesis system contains a data filefor storing classification information of the original sound signal. Bymeans of this classification information creaky voice intervals havingalternating strong and weak periods are identified.

This classification information can be generated by means of a computerprogram, which analyses the original signal in order to detect thecharacteristics of creaky voice within the signal. Alternatively thisclassification can be performed by a human expert. It is to be notedthat the classification is only to be performed once; after the initialclassification an unlimited number of signals of a variety of durationscan be synthesized without further interaction.

In the following preferred embodiments of the invention are described ingreater detail by making reference to the drawings in which:

FIG. 1 is illustrative of a sound signal containing creaky voice and asynthesized signal having an increased duration,

FIG. 2 is a flow chart of an embodiment of a method of the invention,and

FIG. 3 is a block diagram of a preferred embodiment of a computersystem.

FIG. 1 shows an original signal 100 having a duration of 0.07 seconds.The periods of the original signal 100 are classified as ‘v’, ‘e’ or‘o’: The classifier ‘v’ identifies periods of type ‘voiced’; theclassifiers ‘e’ and ‘o’ identify periods which are of type ‘creaky’,whereby ‘e’ designates strong periods and ‘o’ designates weak periods.In this context ‘weak’ means that the amplitude within that period ofthe creaky voice interval is lower than the amplitude of the immediatelypreceding period; likewise ‘strong’ means that the amplitude of thatperiod of the creaky voice sound is higher than the amplitude of theimmediately preceding period of the creaky voice sound interval. Thisclassification of the original signal 100 can be performed by means of acomputer program which analyses the original signal 100 in order toidentify the above described signal characteristics. Alternatively thisclassification can also be performed manually by a human expert. It ispreferred that the classification is performed in a first step by meansof a computer program and is then reviewed in a second step by a humanexpert for improved precision of the classification. Original signal 100and its classification serves as a basis to generate synthesized signal102. The synthesized signal 102 is required to have a duration of about0.16 seconds which is about twice the duration of the original signal100. In order to synthesize the signal 102 with this required durationpitch bell locations j are determined on the time axis 104 in the domainof the synthesized signal 102. The pitch bell locations j are distancedon the time axis 104 by the period p as given by the fundamentalfrequency of the signal to be synthesized. It is to be noted that thesignal to be synthesized can have the same or another pitch/fundamentalfrequency as the original signal. The first required pitch bell locationj=1 is of type ‘e’ as it is the case for the first period e1 of thecreaky voice sound interval within the original signal 100. As aconsequence a pitch bell is obtained from the period e1 of the originalsignal 100 by means of windowing. The following required pitch belllocation j=2 requires a pitch bell of type ‘o’ as the synthesis ofcreaky voice requires alternating strong and weak periods. In order toalso maintain the form of the signal envelope within the creaky voicesound period within original signal 100 a pitch bell is obtained fromthe nearest neighboring period of type ‘o’ within the original signal100, which is period o1. The following required pitch bell location j=3again requires a pitch bell of type ‘e’. This pitch bell is obtainedfrom a period that is categorized as ‘e’ within the original signal 100which is the nearest neighbor to the required pitch bell location j=3.This nearest neighbor is the period e1 within original signal 100. Thismeans that a pitch bell is obtained for pitch bell location j=3 bywindowing period e1 of the original signal 100.

Likewise the consecutive pitch bell location j=4 needs to be of type‘o’. Again the closest period of that type within original signal 100 isselected in order to obtain a pitch bell. This closest period of therequired type is the period o1. This process is performed with respectto all required pitch bell locations j on time axis 100 in order toobtain a pitch bell for each of the required pitch bell locations.

The resulting pitch bells are then overlapped and added in order tosynthesize the required signal 102 containing synthesized creaky voicewith an increased duration. The resulting synthesized signal 102 has asequence of alternating strong and weak periods as it is the case in theoriginal signal 100 in order to maintain this aspect of the originalsignal characteristic. Because of the fact that always nearestneighboring periods of the required category are selected from theoriginal signal 100 for obtaining the pitch bells also the form of thesignal envelope of the creaky part of the original signal 100 ispreserved. The result is a natural sounding synthesized signal 102having all of the characteristics of the original creaky voice soundsignal but with an increased duration.

FIG. 2 shows a corresponding flow chart. In step 200 an original soundsignal is provided. The original sound signal contains at least oneinterval containing creaky voice. In step 202 creaky voice sound periodsare identified and classified. This can be done manually, by means of acomputer program or with the assistance of a computer program. To retainthe naturalness of the creak, the strong and weak periods are markedwith different class-types and this information is used to make analternating choice between the strong and weak periods. Strong (even)periods are marked by type ‘1’ and weak (odd) periods are marked by type‘−1’. In step 204 pitch bells are obtained from the original soundsignal by means of windowing. The windowing operation is performed bymeans of windows which are positioned synchronously with the fundamentalfrequency of the original sound. In step 206 the required pitch belllocations j in the time domain of the signal to be synthesized aredetermined. If the signal to be synthesized is required to have acertain duration this implies that a number of x pitch bell locationswhich are spaced apart by the period p are required where the number xis greater than the number of periods contained in the original signal.In step 208 the index j is initialized to be equal to 1. In step 210 theindex t is initialized to be equal to 1. The index t indicates the typewhich is either ‘1’ or ‘−1’. In step 212 a pitch bell is selected forthe pitch bell location j in the time domain of the signal to besynthesized. This selection is performed by searching for the nearestneighbor of pitch bell location j in the time domain of the originalsignal which has the required type t. This way a pitch bell of type t isselected from the nearest neighbor of pitch bell location j in the timedomain of the original signal. In step 214 the index j is incremented inorder to go to the next pitch bell location j. In step 216 the typeparameter t is multiplied by −1 in order to change the required type tothe category ‘weak’. As a consequence in the following step 212 anearest neighbor for the consecutive pitch bell location j which is oftype ‘−1’ is selected from the domain of the original signal. Steps 212,214 and 216 are repeatedly carried out until pitch bells have beenselected for all of the required pitch bell locations j. After thisselection process has been completed an overlap and add operation isperformed; the resulting signal contains creaky voice and has therequired duration.

FIG. 3 shows a block diagram of a computer system 300, such as atext-to-speech system. The computer system 300 has a module 302 forstoring of a recording of an original sound signal comprising a creakyvoice sound interval. Module 304 serves to store sound classificationinformation, i.e. storing of classifiers ‘v’, ‘e’ and ‘o’ as it isillustrated in the example of FIG. 1. Module 306 serves for windowing ofthe original sound signal in order to obtain pitch bells. Module 308serves to determine the required pitch bell locations in the domain ofthe signal to be synthesized. This is done based on the required lengthy of the signal to be synthesized, the required fundamental frequency ofthe signal to be synthesized, which may or may not be equal tofundamental frequency of the original sound signal. Module 310 servesfor selection of pitch bells which are obtained from module 306. Thepitch bells are selected in accordance with steps 212, 214 and 216 asillustrated in FIG. 2. This means that creaky voice is obtained bycreating a sequence of alternating strong and weak periods whilepreserving the form of the signal envelope of the original sound. Module312 serves to perform an overlap and add operation on the pitch bellsselected by module 310. This way the required synthesized signal isobtained.

1. A method of synthesizing a signal comprising the steps of: a)providing of a first signal having first periods of a first type andsecond periods of a second type in an alternating sequence, b) windowingof the first signal to provide a pitch bell for each of the fist andsecond periods, c) determining a number of required pitch bell locationsfor a second signal to be synthesized, d) selecting of one of the pitchbells for a first one of the required pitch bell locations byidentifying the nearest neighboring period of the first one of therequired pitch bell locations being of the first type, and selecting ofthe pitch bell of the identified period, e) selecting of one of thepitch bells for a second one of the required pitch bell locations byidentifying a nearest neighboring period of the second one of therequired pitch bell locations having the second type, and selecting thepitch bell of the identified period, whereby the steps d) and e) arecarried out for all of the required pitch bell locations, f) performingan overlap and add operation on the selected pitch bells in order tosynthesize the second signal.
 2. The method of claim 1, the first signalhaving alternating strong and weak periods of substantially the samesignal form.
 3. The method of claims 1 or 2, the first signal being acreaky voice signal.
 4. The method of claims 1, 2 or 3, whereby therequired pitch bell locations are determined in order to increase theduration of the second signal to be synthesized.
 5. A computer programproduct, in particular digital storage medium, comprising program meansfor performing the steps of: a) providing of a first signal having firstperiods of a first type and second periods of a second type in analternating sequence, b) windowing of the first signal to provide apitch bell for each of the fist and second periods, c) determining anumber of required pitch bell locations for a second signal to besynthesized, d) selecting of one of the pitch bells for a first one ofthe required pitch bell locations by identifying the nearest neighboringperiod of the first one of the required pitch bell locations being ofthe first type, and selecting of the pitch bell of the identifiedperiod, e) selecting of one of the pitch bells for a second one of therequired pitch bell locations by identifying a nearest neighboringperiod of the second one of the required pitch bell locations having thesecond type, and selecting the pitch bell of the identified period,whereby the steps d) and e) are carried out for all of the requiredpitch bell locations, f) performing an overlap and add operation on theselected pitch bells in order to synthesize the second signal.
 6. Thecomputer program product of claim 5 the program means being adapted todetermine the required pitch bell locations in accordance with arequired duration of the second signal to be synthesized.
 7. A computersystem, in particular text-to-speech synthesis system, comprising: meansfor providing of a first signal having first periods of a first type andsecond periods of a second type in an alternating sequence, means forwindowing of the first signal to provide a pitch bell for each of thefist and second periods, means for determining a number of requiredpitch bell locations for a second signal to be synthesized, means forselecting of one of the pitch bells for a first one of the requiredpitch bell locations by identifying the nearest neighboring period ofthe first one of the required pitch bell locations being of the firsttype, and selecting of the pitch bell of the identified period, and forselecting of one of the pitch bells for a second one of the requiredpitch bell locations by identifying a nearest neighboring period of thesecond one of the required pitch bell locations having the second type,and selecting the pitch bell of the identified period, means forperforming an overlap and add operation on the selected pitch bells inorder to synthesize the second signal
 8. The computer system of claim 7further comprising means for storing of classification data foridentifying first and second periods of the first signal.
 9. Asynthesized signal comprising a number of pitch bells which areoverlapped and added, the pitch bells being of first and second types,the first and second types having substantially the same signal form andvarying amplitudes, the pitch bells being selected to form analternating sequence of first and second type pitch bells.